This is the second assignment for the BCB420H1 course @ U of T. In this assignment, we will be parsing through our cleaned and normalized dataset generated in A1, and performing tests to identify differentially expressed genes and perform preliminary over-representation analyses.
First, let’s load in all of our libraries.
We are working with the GSE205517 dataset. This dataset and its accompanying study investigates the differentiation of hPSCs into left ventricular cardiomyocytes, and compares them to patient derived samples. (Dark et al. 2023) The study compares the transcriptomes of both conditions in order to determine similarity for potential therapeutic purposes. This study is a SuperSeries containing the SubSeries: - GSE203375, containing the hPSCs - GSE204885, containing the patient samples We will download both. In the previous assignment, we performed normalization on the RNA-Seq data, along with remapping HGNC symbols to avoid collisions. In this assignment, we will be perform differential gene expression (DGE) analysis, along with over-representation analysis (ORA). Before we begin these new steps, we will run the data acquisition, data overview, and normalization steps, and overview stats from assignment 1.
First, let’s perform our data acquisition step using GEOQuery. (Davis and Meltzer 2007) We’ll be re-using code blocks from A1.
# GEO Accession numbers
geo_acc_1 <- "GSE203375"
geo_acc_2 <- "GSE204885"
# The filenames for saving/loading data
filename_1 <- paste0(geo_acc_1, ".RData")
filename_2 <- paste0(geo_acc_2, ".RData")
# Reading in files from the GEO or locally
if (!file.exists(filename_1)) {
gset_1 <- getGEO(geo_acc_1, GSEMatrix=TRUE, getGPL=FALSE)
saveRDS(gset_1, filename_1)
} else {
gset_1 <- readRDS(filename_1)
}
gset_1 <- gset_1[[1]]
if (!file.exists(filename_2)) {
gset_2 <- getGEO(geo_acc_2, GSEMatrix=TRUE, getGPL=FALSE)
saveRDS(gset_2, filename_2)
} else {
gset_2 <- readRDS(filename_2)
}
gset_2 <- gset_2[[1]]
We now have our data loaded in. We now need to retrieve the counts tables from the GEO.
# load counts table from GEO
urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path_1 <- paste(urld, paste0("acc=", geo_acc_1), paste0("file=", geo_acc_1, "_raw_counts_GRCh38.p13_NCBI.tsv.gz"), sep="&");
path_2 <- paste(urld, paste0("acc=", geo_acc_2), paste0("file=", geo_acc_2, "_raw_counts_GRCh38.p13_NCBI.tsv.gz"), sep="&");
# This checks if we already have the previous matrices saved in or not. If we have a local copy, we don't re-download. This is a time-saving and efficiency measure.
if (!file.exists(paste0(geo_acc_1,"_raw_counts.RData"))) {
counts_data_1 <- as.matrix(data.table::fread(path_1, header=T, colClasses="integer"), rownames=1)
name_mapping <- setNames(gset_1@phenoData@data[["title"]], gset_1@phenoData@data[["geo_accession"]])
colnames(counts_data_1) <- name_mapping[colnames(counts_data_1)]
saveRDS(counts_data_1, paste0(geo_acc_1, "_raw_counts.RData"))
} else {
counts_data_1 <- readRDS(paste0(geo_acc_1, "_raw_counts.RData"))
}
# We do the same for our second dataset
if (!file.exists(paste0(geo_acc_2,"_raw_counts.RData"))) {
counts_data_2 <- as.matrix(data.table::fread(path_2, header=T, colClasses="integer"), rownames=1)
name_mapping <- setNames(gset_2@phenoData@data[["title"]], gset_2@phenoData@data[["geo_accession"]])
colnames(counts_data_2) <- name_mapping[colnames(counts_data_2)]
saveRDS(counts_data_2, paste0(geo_acc_2, "_raw_counts.RData"))
} else {
counts_data_2 <- readRDS(paste0(geo_acc_2, "_raw_counts.RData"))
}
counts_data <- cbind(counts_data_1, counts_data_2)
This excerpt is directly from A1: Some basic information about the dataset can be obtained through the ExpressionSet we’ve read in. The study investigated differentiation of two different cell lines, H9 of karyotype 46, XX, and MLC2V of karyotype 46, XY, corresponding to cells originating from female and male sources, respectively. It then compared to to patient samples harvested from left and right atrial and ventricular tissue, along with cardiomyocytes individually harvested from each section of the heart from three patients each.
Below are the experimental details provided in the gset objects for both series:
gset_1@experimentData@title # study title
## [1] "Bulk RNA-seq of a time series of hPSCs as they differentiate towards left ventricle cardiomyocytes"
gset_1@experimentData@abstract # study abstract
## [1] "We report that both hESCs and hiPSCs can be differentiated towrads left ventricle cardiomyocytes using an optimised protocol and that these cells transit via first heart field progenitors."
gset_1@experimentData@other$overall_design # summary of experiment
## [1] "Examination of the transcriptome of two hPSC lines as they differentiate towards left ventricle cardiomyocytes; critical time points were collected"
unique(gset_1@phenoData@data[["extract_protocol_ch1.1"]]) # platform
## [1] "KAPA mRNA (polyA) HyperPrep"
unique(gset_1@phenoData@data[["experiment number:ch1"]]) # heart
## [1] "h28" "h30" "h32" "h41" "h18" "h42"
unique(gset_1@phenoData@data[["cell_line:ch1"]]) # cell lines
## [1] "H9" "MLC2V"
unique(gset_1@phenoData@data[["day:ch1"]]) # day of differentiation
## [1] "0" "2" "4" "10" "15" "20" "40" "60" "6"
gset_1@experimentData@other$platform_id # GPL id
## [1] "GPL20301"
gset_1@experimentData@other$last_update_date # last update date
## [1] "Aug 08 2023"
unique(gset_1@phenoData@data$organism_ch1) # organism
## [1] "Homo sapiens"
gset_2@experimentData@title # study title
## [1] "Bulk-RNA-sequencing from chamber specific heart tissue and isolated cardiomyocytes"
gset_2@experimentData@abstract # study abstract
## [1] "Healthy hearts from warm cadavers not usable for heart transplants were identified and tissues were isolated from these. From ventricle tissues, cardiomyocytes were further isolated. Bulk-RNA-sequencing was performed."
gset_2@experimentData@other$overall_design # summary of experiment
## [1] "Bulk-RNA-sequencing from 3 different donors (tissue) and 3 other donors (isolated CMs)"
unique(gset_2@phenoData@data[["extract_protocol_ch1.1"]]) # platform
## [1] "KAPA mRNA (polyA) HyperPrep"
unique(gset_2@phenoData@data[["chamber:ch1"]]) # cell lines
## [1] "LV" "RV" "LA" "RA"
unique(gset_2@phenoData@data[["patient:ch1"]]) # patient
## [1] "P1" "P2" "P3" "P4" "P5" "P6"
unique(gset_2@phenoData@data[["tissue:ch1"]]) # tissue
## [1] "Ventricle Isolated Cells" "Heart tissue"
gset_2@experimentData@other$platform_id # GPL id
## [1] "GPL20301"
gset_2@experimentData@other$last_update_date # last update date
## [1] "Apr 26 2023"
unique(gset_2@phenoData@data$organism_ch1) # organism
## [1] "Homo sapiens"
sample_info_1 <- gset_1@phenoData@data[ , (ncol(gset_1@phenoData@data)-2):ncol(gset_1@phenoData@data)]
colnames(sample_info_1) <- gsub(":ch1", "", colnames(sample_info_1))
knitr::kable(sample_info_1, format = "pipe", caption = "<b>Table 1:</b> Sample information for Cardiomyocytes Derived from Stem Cells")
| cell_line | day | experiment number | |
|---|---|---|---|
| GSM6170585 | H9 | 0 | h28 |
| GSM6170586 | H9 | 2 | h28 |
| GSM6170587 | H9 | 4 | h28 |
| GSM6170588 | H9 | 10 | h28 |
| GSM6170589 | H9 | 15 | h28 |
| GSM6170590 | H9 | 20 | h28 |
| GSM6170591 | H9 | 40 | h28 |
| GSM6170592 | H9 | 60 | h28 |
| GSM6170593 | H9 | 0 | h30 |
| GSM6170594 | H9 | 2 | h30 |
| GSM6170595 | H9 | 4 | h30 |
| GSM6170596 | H9 | 6 | h30 |
| GSM6170597 | H9 | 10 | h30 |
| GSM6170598 | H9 | 15 | h30 |
| GSM6170599 | H9 | 20 | h30 |
| GSM6170600 | H9 | 40 | h30 |
| GSM6170601 | H9 | 60 | h30 |
| GSM6170602 | H9 | 0 | h32 |
| GSM6170603 | H9 | 2 | h32 |
| GSM6170604 | H9 | 4 | h32 |
| GSM6170605 | H9 | 6 | h32 |
| GSM6170606 | H9 | 15 | h32 |
| GSM6170607 | H9 | 20 | h32 |
| GSM6170608 | H9 | 40 | h32 |
| GSM6170609 | H9 | 60 | h32 |
| GSM6170610 | H9 | 0 | h41 |
| GSM6170611 | H9 | 2 | h41 |
| GSM6170612 | H9 | 4 | h41 |
| GSM6170613 | H9 | 6 | h41 |
| GSM6170614 | H9 | 10 | h41 |
| GSM6170615 | H9 | 15 | h41 |
| GSM6170616 | H9 | 20 | h41 |
| GSM6170617 | H9 | 40 | h41 |
| GSM6170618 | MLC2V | 0 | h18 |
| GSM6170619 | MLC2V | 2 | h18 |
| GSM6170620 | MLC2V | 4 | h18 |
| GSM6170621 | MLC2V | 6 | h18 |
| GSM6170622 | MLC2V | 10 | h18 |
| GSM6170623 | MLC2V | 15 | h18 |
| GSM6170624 | MLC2V | 20 | h18 |
| GSM6170625 | MLC2V | 40 | h18 |
| GSM6170626 | MLC2V | 60 | h18 |
| GSM6170627 | MLC2V | 0 | h42 |
| GSM6170628 | MLC2V | 2 | h42 |
| GSM6170629 | MLC2V | 4 | h42 |
| GSM6170630 | MLC2V | 6 | h42 |
| GSM6170631 | MLC2V | 10 | h42 |
| GSM6170632 | MLC2V | 15 | h42 |
| GSM6170633 | MLC2V | 20 | h42 |
| GSM6170634 | MLC2V | 40 | h42 |
| GSM6170635 | MLC2V | 60 | h42 |
sample_info_2 <- gset_2@phenoData@data[ , (ncol(gset_2@phenoData@data)-2):ncol(gset_2@phenoData@data)]
colnames(sample_info_2) <- gsub(":ch1", "", colnames(sample_info_2))
knitr::kable(sample_info_2, format = "pipe", caption = "<b>Table 2:</b> Sample information for Patient-Derived Samples")
| chamber | patient | tissue | |
|---|---|---|---|
| GSM6199297 | LV | P1 | Ventricle Isolated Cells |
| GSM6199298 | RV | P1 | Ventricle Isolated Cells |
| GSM6199299 | LV | P2 | Ventricle Isolated Cells |
| GSM6199300 | RV | P2 | Ventricle Isolated Cells |
| GSM6199301 | LV | P3 | Ventricle Isolated Cells |
| GSM6199302 | RV | P3 | Ventricle Isolated Cells |
| GSM6199303 | LA | P4 | Heart tissue |
| GSM6199304 | LV | P4 | Heart tissue |
| GSM6199305 | RA | P4 | Heart tissue |
| GSM6199306 | RV | P4 | Heart tissue |
| GSM6199307 | LA | P5 | Heart tissue |
| GSM6199308 | LV | P5 | Heart tissue |
| GSM6199309 | RA | P5 | Heart tissue |
| GSM6199310 | RV | P5 | Heart tissue |
| GSM6199311 | LA | P6 | Heart tissue |
| GSM6199312 | LV | P6 | Heart tissue |
| GSM6199313 | RA | P6 | Heart tissue |
| GSM6199314 | RV | P6 | Heart tissue |
To clean our data, we will do two things: 1. Filter out zero-expressors 2. Perform HGNC re-mapping 3. Normalize the data
From A1, we know that there are genes that minimally express throughout different conditions, and as a result, we will be filtering those out.
# Pick out all rows that have zeroes across all conditions
rows_to_keep <- apply(counts_data, 1, function(row) any(row != 0))
# Create new matrix containing only the rows that aren't all zeroes
counts_data_zero_filtered <- counts_data[rows_to_keep, ]
In line with the experimenters’ analyses, we will normalize the data together
First, let’s perform CPM filtering to remove low expressors.
# Minimum number of samples for the hPSC run
min_num_samples <- 2
counts_data_zero_filtered_matrix <- as.matrix(counts_data_zero_filtered)
# Get rid of low counts
# We add 0.1 for numerical stability and to prevent NA values when evaluating logs
keep = rowSums(log2(cpm(counts_data_zero_filtered_matrix+0.1)) > 1) > min_num_samples
filtered_counts_matrix = counts_data_zero_filtered_matrix[keep,]
After, we will use TMM from edgeR (Robinson, McCarthy, and Smyth 2010) to perform our normalization steps.
# We will make our DGEList with the filtered count matrices
d <- DGEList(counts=filtered_counts_matrix)
# Calculate normalization factors
d <- calcNormFactors(d)
# Apply normalization
normalized_counts <- cpm(d)
Normalization is now done, and we can move onto HGNC mapping.
The protocol followed for HGNC mapping is nearly identical to the first assignment. First, we will add a column to our normalized with the appropriate NCBI gene ID. While we do have the annotation table, it is also true that we were specified to still manually perform this step. We will download a copy of the HGNC dataset. (Tweedie et al. 2020) The dataset is confirmed to be up-to-date with the date of download.
# Download the HGNC file if it doesn't already exist in the cwd
if (!file.exists("hgnc_genes.RData")) {
hgnc_genes <- import_hgnc_dataset(file=latest_archive_url())
saveRDS(hgnc_genes, "hgnc_genes.RData")
} else {
hgnc_genes <- readRDS("hgnc_genes.RData")
}
hgnc_mapping <- hgnc_genes[, c('symbol', 'entrez_id')]
Now that we have a mapping, we can apply this to the table.
# Convert the normalized counts into a dataframe
normalized_counts_df <- as.data.frame(normalized_counts)
# Add rownames
normalized_counts_df$NCBI_gene_id <- rownames(normalized_counts_df)
# Map the entrez_id to the NCBI_gnes
labelled_counts_data <- merge(normalized_counts_df, hgnc_mapping, by.x = "NCBI_gene_id", by.y = 'entrez_id', all.x = TRUE)
labelled_counts_data <- labelled_counts_data[,c(1, ncol(labelled_counts_data), 3:ncol(labelled_counts_data)-1)]
knitr::kable(labelled_counts_data[c(1:10), c(1,2)], format = "pipe", caption = "<b>Table 3:</b> Table Matching First 10 NCBI Gene IDs to the Approved HGNC Symbols")
| NCBI_gene_id | symbol |
|---|---|
| 1 | A1BG |
| 100 | ADA |
| 1000 | CDH2 |
| 10000 | AKT3 |
| 100008587 | RNA5-8SN5 |
| 100008588 | RNA18SN5 |
| 100008589 | RNA28SN5 |
| 100009676 | ZBTB11-AS1 |
| 10001 | MED6 |
| 10003 | NAALAD2 |
labelled_counts_data <- labelled_counts_data[, -c(1)]
Now, we must deal with NA and empty values, along with
many-to-one mappings.
# Remove all NA labels
no_na_labelled_counts_data <- labelled_counts_data[!is.na(labelled_counts_data$symbol), ]
# Remove all empty strings symbols
no_emp_labelled_counts_data <- no_na_labelled_counts_data[no_na_labelled_counts_data$symbol != '', ]
rownames(no_emp_labelled_counts_data) <- no_emp_labelled_counts_data[, 1]
final_normalized_data <- no_emp_labelled_counts_data[, -c(1)]
Now that we have our data, we can look at a few counting statistics to gain insight into what we’re looking at.
# Create a list of overview stats
overview_stats <- list()
# Do it for each sample
for (sample_name in colnames(final_normalized_data)) {
sample_data <- final_normalized_data[, sample_name]
total_counts <- sum(sample_data)
mean_counts <- mean(sample_data)
median_counts <- median(sample_data)
sd_counts <- sd(sample_data)
var_counts <- var(sample_data)
overview_stats[[sample_name]] <- c(total_counts, mean_counts, median_counts, sd_counts, var_counts)
}
overview_stats_df <- do.call(rbind, overview_stats)
rownames(overview_stats_df) <- names(overview_stats)
colnames(overview_stats_df) <- c("Total Counts", "Mean Counts", "Median Counts", "Counts Standard Deviation", "Counts Variation")
knitr::kable(overview_stats_df[, ], format = "pipe", caption = "<b>Table 4:</b> Overview Statistics of Normalized Samples. The total counts are somewhat similar, but the mean counts are lower in the hPSC-derived samples. Mean counts are somewhat similar in both, but the median counts are greater in the hPSCs. The standard deviation and variation in the hPSCs is far lesser than that of the patient samples.")
| Total Counts | Mean Counts | Median Counts | Counts Standard Deviation | Counts Variation | |
|---|---|---|---|---|---|
| Heart 28 Day 0 H9 | 1021771.9 | 59.98426 | 12.24620 | 431.8453 | 186490.34 |
| Heart 28 Day 2 H9 | 786599.4 | 46.17819 | 13.63105 | 154.4348 | 23850.12 |
| Heart 28 Day 4 H9 | 716839.2 | 42.08285 | 13.82826 | 125.4171 | 15729.45 |
| Heart 28 Day 10 H9 | 793482.4 | 46.58227 | 13.73218 | 236.6323 | 55994.84 |
| Heart 28 Day 15 H9 | 745914.9 | 43.78977 | 14.72937 | 195.7736 | 38327.31 |
| Heart 28 Day 20 H9 | 993073.6 | 58.29949 | 14.57508 | 340.7407 | 116104.22 |
| Heart 28 Day 40 H9 | 1023479.0 | 60.08448 | 14.23242 | 342.5525 | 117342.18 |
| Heart 28 Day 60 H9 | 829562.9 | 48.70042 | 14.76538 | 223.1736 | 49806.45 |
| Heart 30 Day 0 H9 | 863916.5 | 50.71719 | 12.75642 | 218.8989 | 47916.72 |
| Heart 30 Day 2 H9 | 771994.0 | 45.32077 | 13.16358 | 169.0652 | 28583.03 |
| Heart 30 Day 4 H9 | 699609.0 | 41.07133 | 13.84685 | 136.2720 | 18570.05 |
| Heart 30 Day 6 H9 | 786225.2 | 46.15623 | 14.09130 | 184.6986 | 34113.57 |
| Heart 30 Day 10 H9 | 761009.1 | 44.67589 | 14.40243 | 189.8644 | 36048.49 |
| Heart 30 Day 15 H9 | 908915.2 | 53.35888 | 13.90010 | 360.0973 | 129670.08 |
| Heart 30 Day 20 H9 | 887759.9 | 52.11694 | 13.69645 | 273.7729 | 74951.58 |
| Heart 30 Day 40 H9 | 1010387.1 | 59.31591 | 13.56070 | 489.7979 | 239901.95 |
| Heart 30 Day 60 H9 | 920475.0 | 54.03751 | 14.47831 | 373.9279 | 139822.08 |
| Heart 32 Day 0 H9 | 779406.7 | 45.75594 | 13.18829 | 149.3232 | 22297.42 |
| Heart 32 Day 2 H9 | 770684.2 | 45.24387 | 13.37835 | 153.5595 | 23580.52 |
| Heart 32 Day 4 H9 | 702719.5 | 41.25393 | 13.64478 | 129.9497 | 16886.93 |
| Heart 32 Day 6 H9 | 772659.8 | 45.35985 | 14.03981 | 170.6316 | 29115.14 |
| Heart 32 Day 15 H9 | 844624.9 | 49.58465 | 13.86116 | 271.0413 | 73463.40 |
| Heart 32 Day 20 H9 | 831683.9 | 48.82493 | 14.49759 | 253.6298 | 64328.10 |
| Heart 32 Day 40 H9 | 805256.4 | 47.27347 | 15.13632 | 206.4093 | 42604.78 |
| Heart 32 Day 60 H9 | 1339814.7 | 78.65532 | 12.68147 | 1102.2826 | 1215026.94 |
| Heart 41 Day 0 H9 | 747252.1 | 43.86827 | 13.25882 | 164.7467 | 27141.46 |
| Heart 41 Day 2 H9 | 771285.9 | 45.27920 | 13.46844 | 181.9923 | 33121.19 |
| Heart 41 Day 4 H9 | 727104.3 | 42.68547 | 13.48812 | 172.0213 | 29591.32 |
| Heart 41 Day 6 H9 | 744472.3 | 43.70508 | 14.23815 | 176.5771 | 31179.46 |
| Heart 41 Day 10 H9 | 800151.2 | 46.97377 | 14.00636 | 311.0775 | 96769.22 |
| Heart 41 Day 15 H9 | 927299.3 | 54.43814 | 13.63959 | 425.1034 | 180712.93 |
| Heart 41 Day 20 H9 | 809279.0 | 47.50963 | 14.63262 | 230.8466 | 53290.15 |
| Heart 41 Day 40 H9 | 826540.3 | 48.52297 | 15.18484 | 211.2894 | 44643.22 |
| Heart 18 Day 0 MLC2V | 802319.7 | 47.10108 | 13.25450 | 158.2561 | 25045.00 |
| Heart 18 Day 2 MLC2V | 764355.6 | 44.87235 | 13.46746 | 169.3843 | 28691.04 |
| Heart 18 Day 4 MLC2V | 707001.2 | 41.50530 | 14.09808 | 150.1960 | 22558.85 |
| Heart 18 Day 6 MLC2V | 769018.6 | 45.14609 | 14.25261 | 169.0281 | 28570.48 |
| Heart 18 Day 10 MLC2V | 791369.4 | 46.45822 | 14.52728 | 219.7194 | 48276.59 |
| Heart 18 Day 15 MLC2V | 791232.2 | 46.45017 | 14.62889 | 233.2285 | 54395.55 |
| Heart 18 Day 20 MLC2V | 899519.9 | 52.80732 | 14.82049 | 244.6640 | 59860.47 |
| Heart 18 Day 40 MLC2V | 935508.8 | 54.92009 | 14.67489 | 361.6569 | 130795.68 |
| Heart 18 Day 60 MLC2V | 933355.7 | 54.79369 | 14.68259 | 421.3648 | 177548.34 |
| Heart 42 Day 0 MLC2V | 816656.5 | 47.94273 | 13.07401 | 196.8750 | 38759.78 |
| Heart 42 Day 2 MLC2V | 789418.7 | 46.34370 | 13.20376 | 183.5629 | 33695.34 |
| Heart 42 Day 4 MLC2V | 730009.3 | 42.85601 | 13.51457 | 162.4742 | 26397.86 |
| Heart 42 Day 6 MLC2V | 779092.2 | 45.73748 | 13.98386 | 184.5081 | 34043.25 |
| Heart 42 Day 10 MLC2V | 860421.6 | 50.51201 | 14.31508 | 232.9254 | 54254.26 |
| Heart 42 Day 15 MLC2V | 775917.9 | 45.55113 | 14.74511 | 236.1324 | 55758.50 |
| Heart 42 Day 20 MLC2V | 849629.4 | 49.87844 | 14.70218 | 252.2778 | 63644.10 |
| Heart 42 Day 40 MLC2V | 931343.7 | 54.67557 | 14.76092 | 407.5923 | 166131.48 |
| Heart 42 Day 60 MLC2V | 972308.9 | 57.08048 | 15.28911 | 496.6643 | 246675.41 |
| Patient 1 Left Ventricle Isolated Cells | 2097613.3 | 123.14273 | 12.71544 | 2054.7924 | 4222171.89 |
| Patient 1 Right Ventricle Isolated cells | 1801820.6 | 105.77789 | 12.64497 | 1652.3177 | 2730153.91 |
| Patient 2 Left Ventricle Isolated Cells | 2448968.4 | 143.76943 | 11.90376 | 2919.5824 | 8523961.52 |
| Patient 2 Right Ventricle Isolated Cells | 2338963.9 | 137.31149 | 11.48868 | 2609.5704 | 6809857.69 |
| Patient 3 Left Ventricle Isolated Cells | 1955785.5 | 114.81657 | 12.28001 | 1785.9598 | 3189652.30 |
| Patient 3 Right Ventricle Isolated Cells | 1841091.5 | 108.08333 | 12.63626 | 1779.0770 | 3165115.09 |
| Patient 4 Left Atria Tissue | 1289436.8 | 75.69783 | 15.15308 | 1052.6384 | 1108047.51 |
| Patient 4 Left Ventricle Tissue | 1993945.7 | 117.05681 | 14.52426 | 2527.4069 | 6387785.88 |
| Patient 4 Right Atria Tissue | 1289380.0 | 75.69449 | 14.74821 | 1068.7044 | 1142129.05 |
| Patient 4 Right Ventricle Tissue | 1729378.7 | 101.52511 | 13.97466 | 1684.3179 | 2836926.68 |
| Patient 5 Left Atria Tissue | 1374196.6 | 80.67374 | 14.72154 | 1212.6323 | 1470477.14 |
| Patient 5 Left Ventricle Tissue | 1610446.1 | 94.54304 | 14.15593 | 1607.2241 | 2583169.31 |
| Patient 5 Right Atria Tissue | 1383940.8 | 81.24579 | 15.08133 | 1388.4255 | 1927725.46 |
| Patient 5 Right Ventricle Tissue | 1649782.1 | 96.85230 | 14.02391 | 1703.4872 | 2901868.80 |
| Patient 6 Left Atria Tissue | 1432786.4 | 84.11333 | 14.46312 | 1383.8852 | 1915138.25 |
| Patient 6 Left Ventricle Tissue | 1393964.4 | 81.83424 | 14.18233 | 1300.1378 | 1690358.21 |
| Patient 6 Right Atria Tissue | 1273212.8 | 74.74538 | 14.75168 | 1055.1550 | 1113352.02 |
| Patient 6 Right Ventricle Tissue | 1460262.8 | 85.72636 | 14.21627 | 1364.0782 | 1860709.32 |
We will now use MDS plots, which involve using dimensional reduction techniques to observe clustering of our samples. (Ritchie et al. 2015)
par(xpd=T, mar=par()$mar+c(1.5,0,0,3))
# Manually specify the order of the conditions
ordered_conditions <- c("0", "2", "4", "6", "10", "15", "20", "40", "60", "LA", "LV", "RA", "RV")
# Create a factor with the specified order
conditions_factor <- factor(c(sample_info_1$day, sample_info_2$chamber), levels = ordered_conditions)
# Generate a color palette with the same number of colors as the number of conditions
n_colours <- length(ordered_conditions)
palette <- rainbow(n_colours)
# Plot the MDS with the ordered conditions and corresponding colors
limma::plotMDS(d, pch = 1, col = palette[conditions_factor])
# Add a title and legend
mtext("Figure 1: MDS Showing Clustering of Patient Samples\nFrom Temporally Clustered hPSC-Derived CMs", side=1, line=5)
legend("bottomright",
legend = ordered_conditions,
inset = c(-0.2, 0),
pch = c(1), col = palette, title = "Day/Chamber",
bty = 'n', cex = 0.75)
par(mar=c(5, 4, 4, 2) + 0.1)
We can see that there is strong clustering of all the patient-derived samples, and there’s also temporal clustering of all the stem cell-derived CMs. This is evidenced by the fact the from left to right, we see a ‘continuous’ clustering of cardiomyocytes from d60 to d0. We see that the patient samples all cluster together quite strongly, but further away in the firsta and second dimensions. This can likely be explained due to higher variance and low median counts in the patient data, but the fact that the cells came from warm cadavers, and the cell niche The range for the MDS Leading logFC dimensions are also quite high, which could be explained due to having two different types of data here.
We will now prefer differential gene expression analyses. Based on the results of the previously generated MDS plots, I think that important analyses would be: * Pairwise comparisons between each time point of the hPSC to cardiomyocyte differentiation + Specifically a comparison between day 0 and day 60 of the stem cell experiments - Comparison of the day 60 * hPSC-cardiomyocyte to the rest of the patient heart cells
We will determine the p-values of each of the experiment using edgeR. (Robinson, McCarthy, and Smyth 2010)
# 1. Prepare the data
# Truncate the sample names
converted_names <- gsub("Heart \\d+ Day (\\d+) .*", "day \\1", colnames(final_normalized_data))
converted_names <- gsub("Patient \\d+ (Left|Right) (Ventricle|Atria) .*", "\\1 \\2", converted_names)
converted_names <- gsub(" ", "", converted_names)
# Create a data frame with sample information
samples <- data.frame(converted_names)
# Convert the count data to a DGEList object
dge <- DGEList(counts = final_normalized_data)
# 2. Create the design matrix
# Create the design matrix with only 'day' as a factor
design <- model.matrix(~0 + factor(samples$converted_names))
# Rename the columns to be more interpretable
colnames(design) <- gsub("factor\\(samples\\$converted_names\\)", "", colnames(design))
dge <- estimateDisp(dge, design)
# 4. Fit the model
fit <- glmQLFit(dge, design)
Now, we will create the contrast matrices for comparisons.
# 5. Create contrast matrices
twoVS0con <- makeContrasts(
day2vsday0 = day2 - day0,
levels = design
)
fourVS2con <- makeContrasts(
day4vsday2 = day4 - day2,
levels = design
)
sixVS4con <- makeContrasts(
day6vsday4 = day6 - day4,
levels = design
)
tenVS6con <- makeContrasts(
day10vsday6 = day10 - day6,
levels = design
)
fifteenVS10con <- makeContrasts(
day15vsday10 = day15 - day10,
levels = design
)
twentyVS15con <- makeContrasts(
day20vsday15 = day20 - day15,
levels = design
)
fortyVS20con <- makeContrasts(
day40vsday20 = day40 - day20,
levels = design
)
sixtyVS40con <- makeContrasts(
day60vsday40 = day60 - day40,
levels = design
)
sixtyVS0con <- makeContrasts(
day60vsday0 = day60 - day0,
levels = design
)
lvVS60 <- makeContrasts(
LVvsday0 = LeftVentricle - day60,
levels = design
)
rvVS60 <- makeContrasts(
RVvsday60 = RightVentricle - day60,
levels = design
)
laVS60 <- makeContrasts(
LAvsday60 = LeftAtria - day60,
levels = design
)
raVS60 <- makeContrasts(
RAvsday60 = RightAtria - day60,
levels = design
)
lvVSrv <- makeContrasts(
lvvsrv = LeftVentricle - RightVentricle,
levels = design
)
Now, we can find differential expression between conditions
# 6. Perform the tests
twoVS0_lrt <- glmQLFTest(fit, contrast = twoVS0con)
fourVS2_lrt <- glmQLFTest(fit, contrast = fourVS2con)
sixVS4_lrt <- glmQLFTest(fit, contrast = sixVS4con)
tenVS6_lrt <- glmQLFTest(fit, contrast = tenVS6con)
fifteenVS10_lrt <- glmQLFTest(fit, contrast = fifteenVS10con)
twentyVS15_lrt <- glmQLFTest(fit, contrast = twentyVS15con)
fortyVS20_lrt <- glmQLFTest(fit, contrast = fortyVS20con)
sixtyVS40_lrt <- glmQLFTest(fit, contrast = sixtyVS40con)
sixtyVS0_lrt <- glmQLFTest(fit, contrast = sixtyVS0con)
lvVS60_lrt <- glmQLFTest(fit, contrast = lvVS60)
rvVS60_lrt <- glmQLFTest(fit, contrast = rvVS60)
laVS60_lrt <- glmQLFTest(fit, contrast = laVS60)
raVS60_lrt <- glmQLFTest(fit, contrast = raVS60)
lvVSrv_lrt <- glmQLFTest(fit, contrast = lvVSrv)
# 7. Extract results
results_twoVS0 <- topTags(twoVS0_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fourVS2 <- topTags(fourVS2_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixVS4 <- topTags(sixVS4_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_tenVS6 <- topTags(tenVS6_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fifteenVS10 <- topTags(fifteenVS10_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_twentyVS15 <- topTags(twentyVS15_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fortyVS20 <- topTags(fortyVS20_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixtyVS40 <- topTags(sixtyVS40_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixtyVS0 <- topTags(sixtyVS0_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_lvVS60 <- topTags(lvVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_rvVS60 <- topTags(rvVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_laVS60 <- topTags(laVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_raVS60 <- topTags(raVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_lvVSrv <- topTags(lvVSrv_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
I chose to use a cut-off of 0.01, in-line with the original study, and the limit the amount of genes that show up as significant, as there is a lot of temporal variation in stem cell differentiation protocols. In terms of the multiple hypothesis correction, I chose to use Benjamini-Hochberg for its robustness and that it seems quite standard in such analyses.
We will now present summary data in a table for each comparison.
summary_table <- data.frame(
Comparison = c("Day 2 VS Day 0",
"Day 4 VS Day 2",
"Day 6 VS Day 4",
"Day 10 VS Day 6",
"Day 15 VS Day 10",
"Day 20 VS Day 15",
"Day 40 VS Day 20",
"Day 60 VS Day 40",
"Day 60 VS Day 0",
"LV VS Day 60",
"RV VS Day 60",
"LA VS Day 60",
"RA VS Day 60",
"LV VS RV"
),
Significant = c(length(which(results_twoVS0$table$PValue < 0.01)),
length(which(results_fourVS2$table$PValue < 0.01)),
length(which(results_sixVS4$table$PValue < 0.01)),
length(which(results_tenVS6$table$PValue < 0.01)),
length(which(results_fifteenVS10$table$Value < 0.01)),
length(which(results_twentyVS15$table$PValue < 0.01)),
length(which(results_fortyVS20$table$PValue < 0.01)),
length(which(results_sixtyVS40$table$PValue < 0.01)),
length(which(results_sixtyVS0$table$PValue < 0.01)),
length(which(results_lvVS60$table$PValue < 0.01)),
length(which(results_rvVS60$table$PValue < 0.01)),
length(which(results_laVS60$table$PValue < 0.01)),
length(which(results_raVS60$table$PValue < 0.01)),
length(which(results_lvVSrv$table$PValue < 0.01))
),
Corrected = c(length(which(results_twoVS0$table$FDR < 0.01)),
length(which(results_fourVS2$table$FDR < 0.01)),
length(which(results_sixVS4$table$FDR < 0.01)),
length(which(results_tenVS6$table$FDR < 0.01)),
length(which(results_fifteenVS10$table$FDR < 0.01)),
length(which(results_twentyVS15$table$FDR < 0.01)),
length(which(results_fortyVS20$table$FDR < 0.01)),
length(which(results_sixtyVS40$table$FDR < 0.01)),
length(which(results_sixtyVS0$table$FDR < 0.01)),
length(which(results_lvVS60$table$FDR < 0.01)),
length(which(results_rvVS60$table$FDR < 0.01)),
length(which(results_laVS60$table$FDR < 0.01)),
length(which(results_raVS60$table$FDR < 0.01)),
length(which(results_lvVSrv$table$FDR < 0.01))
)
)
kable(summary_table, caption = "Table 5: The number of genes that show the number of significantly differentially expressed genes before and after FDR correction. For the temporal analysis, the first two comparisons hover around 1800 genes, the next two are at around 900, and then there's a steep drop off at the day 15 vs 10 and day 20 vs 15 comparisons. This picks back up. This picks back up intil days 60 to 40, showing minimal change, suggesting that they're in late stage maturation. Between day 60 and patient samples, many genes, all above 6000, are significantly differentially expressed. Between ventricles, no change is observed. ")
| Comparison | Significant | Corrected |
|---|---|---|
| Day 2 VS Day 0 | 3230 | 1895 |
| Day 4 VS Day 2 | 3006 | 1774 |
| Day 6 VS Day 4 | 2056 | 866 |
| Day 10 VS Day 6 | 2301 | 1006 |
| Day 15 VS Day 10 | 0 | 64 |
| Day 20 VS Day 15 | 1641 | 416 |
| Day 40 VS Day 20 | 2501 | 1163 |
| Day 60 VS Day 40 | 464 | 15 |
| Day 60 VS Day 0 | 9281 | 8768 |
| LV VS Day 60 | 10698 | 10290 |
| RV VS Day 60 | 10423 | 9999 |
| LA VS Day 60 | 7281 | 6360 |
| RA VS Day 60 | 7082 | 6126 |
| LV VS RV | 17 | 0 |
As expected, the comparison with the greatest number of differentially expressed genes was between day 60 and day 0, which compares an hPSC to a left ventricular cardiomyocyte derived from stem cells. Interestingly, we see a very small amount of differentially expressed genes for day 15 vs day 10 and day 60 and day 40. In terms of the patient sample comparisons, we see that all of them have a great number of genes with a significant values of differential expression, which indicates that even the mature hPSC-derived CM is very far from the patient by transcriptome. This could be explained due to the niche of the cell–cardiomyocytes in vivo are constantly beating, working to pump blood, and are also surrounded by other tissue types which secrete signals to regulate expression of different transcripts. In a dish, this isn’t the case, which could explain this phenomenon. We can investigate this more closely once we see which genes come out as top hits.
We will save our results in a list for ease of downstream analysis.
# List of all the results objects
results_list <- list(
results_twoVS0 = results_twoVS0,
results_fourVS2 = results_fourVS2,
results_sixVS4 = results_sixVS4,
results_tenVS6 = results_tenVS6,
results_fifteenVS10 = results_fifteenVS10,
results_twentyVS15 = results_twentyVS15,
results_fortyVS20 = results_fortyVS20,
results_sixtyVS40 = results_sixtyVS40,
results_sixtyVS0 = results_sixtyVS0,
results_lvVS60 = results_lvVS60,
results_rvVS60 = results_rvVS60,
results_laVS60 = results_laVS60,
results_raVS60 = results_raVS60,
results_lvVSrv = results_lvVSrv
)
# Titles for each plot
titles <- c(
"Day 2 vs Day 0",
"Day 4 vs Day 2",
"Day 6 vs Day 4",
"Day 10 vs Day 6",
"Day 15 vs Day 10",
"Day 20 vs Day 15",
"Day 40 vs Day 20",
"Day 60 vs Day 40",
"Day 60 vs Day 0",
"LV vs Day 60",
"RV vs Day 60",
"LA vs Day 60",
"RA vs Day 60",
"LV vs RV"
)
We will use volcano plots to show differentially expressed genes. To do this efficiently, we will make a function for making volcano plot. To determine top hits, we will filter by having an FDR of less that 0.01, and having an absolute logFC value of 2. This is in line with the paper’s protocol.
create_volcano_plot <- function(data, title, logFC_threshold = 2, FDR_threshold = 0.01, top_n = 5) {
# Add gene names as a column
data$Gene <- rownames(data)
# Add a column to indicate the direction of regulation
data$direction <- with(data, ifelse(logFC > logFC_threshold & FDR < FDR_threshold, "Upregulated",
ifelse(logFC < -logFC_threshold & FDR < FDR_threshold, "Downregulated", "Neutral")))
# Select top genes based on FDR and absolute logFC
top_genes <- data %>%
filter(direction %in% c("Upregulated", "Downregulated")) %>%
arrange(FDR, abs(logFC)) %>%
group_by(direction) %>%
slice_head(n = top_n) %>%
ungroup()
# Create the volcano plot
volcano_plot <- ggplot(data, aes(x = logFC, y = -log10(FDR), color = direction)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = title, x = "Log Fold Change", y = "-log10(FDR)") +
scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Neutral" = "grey")) +
theme(legend.title = element_blank()) +
geom_vline(xintercept = c(-logFC_threshold, logFC_threshold), linetype = "dashed", color = "black") +
geom_hline(yintercept = -log10(FDR_threshold), linetype = "dashed", color = "black") +
geom_text_repel(data = top_genes, aes(label = Gene), size = 3, nudge_y = 1)
return(volcano_plot)
}
Now, let’s plot
# Create a plot for each comparison
plots <- lapply(seq_along(results_list), function(i) {
create_volcano_plot(results_list[[i]]$table, titles[i])
})
# Display the plots
for (i in seq_along(plots)) {
print(plots[[i]])
}
Figure 2: Volcano plots for all comparisons. All are displayed for breadth of analysis. The top hits are also displayed. The interpretation for these results is discussed in the table heading for Table 5 and after.
We will now filter with the same parameters used to filter in the volcano plots.
# Filter each comparison based on the specified criteria
filtered_results_list <- lapply(results_list, function(result) {
result$table %>%
filter(FDR < 0.01, abs(logFC) > 2)
})
# Names for the filtered results
names(filtered_results_list) <- names(results_list)
filtered_results_display_df <- data.frame(
Num_Genes = sapply(filtered_results_list, nrow)
)
kable(filtered_results_display_df, caption = "Table 6: Number of genes meeting criteria for each comparison. Notice that because of the logFC conition, all values are necessarily equal to or smaller than the values in Table 5.")
| Num_Genes | |
|---|---|
| results_twoVS0 | 563 |
| results_fourVS2 | 616 |
| results_sixVS4 | 368 |
| results_tenVS6 | 251 |
| results_fifteenVS10 | 20 |
| results_twentyVS15 | 132 |
| results_fortyVS20 | 323 |
| results_sixtyVS40 | 9 |
| results_sixtyVS0 | 3720 |
| results_lvVS60 | 3836 |
| results_rvVS60 | 3762 |
| results_laVS60 | 2369 |
| results_raVS60 | 2289 |
| results_lvVSrv | 0 |
# Extract gene names from each filtered result and combine them into a single vector
union_gene_names <- unlist(lapply(filtered_results_list, function(filtered_result) {
rownames(filtered_result)
}))
# Remove duplicate gene names
all_genes <- unique(union_gene_names)
We will also need single genes for downstream analysis.
# Get only the unique genes for downstream analysis
gene_counts <- table(union_gene_names)
unique_genes <- names(gene_counts[gene_counts == 1])
We will now present our analysis, and specifically clustering of samples using a heatmap by using the Complex HeatMap library. (Gu 2022)
# Subset the normalized count matrix to keep only the genes of interest, ensure no NAs
heatmap_data <- final_normalized_data[all_genes, , drop = FALSE]
heatmap_data <- heatmap_data[complete.cases(heatmap_data), ]
# Scale the expression data
scaled_heatmap_data <- t(scale(t(heatmap_data)))
# Order the converted_names
ordered_converted_names <- unique(converted_names)
day6element <- ordered_converted_names[9]
ordered_converted_names <- ordered_converted_names[-9]
ordered_converted_names <- append(ordered_converted_names, day6element, after=3)
# Generate a color vector for conditions, using rainbow and setting the names attribute
condition_colors <- setNames(rainbow(length(ordered_converted_names)), ordered_converted_names)
# Create annotations for the heatmap
annotations_df <- data.frame(Condition = factor(converted_names, levels = ordered_converted_names))
# Create annotations for the heatmap
heatmap_annotations <- HeatmapAnnotation(Condition = annotations_df$Condition, col = list(Condition = condition_colors), show_legend = TRUE)
# Generate the heatmap
heatmap <- Heatmap(
as.matrix(scaled_heatmap_data),
top_annotation = heatmap_annotations,
cluster_rows = TRUE,
cluster_columns = TRUE, # Turn off column clustering to maintain the order of days and conditions
col = colorRamp2(c(-2, 0, 4), c("darkgreen", "white", "darkmagenta")), # Adjust colors as needed
show_column_names = FALSE, # Turn on to see the order of columns
show_row_names = FALSE,
show_heatmap_legend = TRUE,
column_title = "Figure 3: Gene Expression of all Significantly Differentially Expressed Genes"
)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` argument by explicitly setting
## TRUE/FALSE to it.
##
## Set `ht_opt$message = FALSE` to turn off this message.
# Draw the heatmap
draw(heatmap)
The conditions cluster somewhat well, but out of order temporally. This may be because many genes may be up and down-regulated during differentiation, and as a result, a gene coming on-off-on or vice-versa may result in different timepoint clustering together exclusively through transcriptomes and not temporally. Interestingly, the patient samples cluster well together as expected, but they are adjacent to day 6, which is unusual. My best explanation may be because of variance, but I will try to better cluster these using unique genes for each condition instead. By doing this, we get rid of genes that are significant more than once. This way, the genes act as better ‘signatures’ of each condition.
# Subset the normalized count matrix to keep only the genes of interest, ensure no NAs
unique_heatmap_data <- final_normalized_data[unique_genes, , drop = FALSE]
unique_heatmap_data <- unique_heatmap_data[complete.cases(unique_heatmap_data), ]
# Scale the expression data
unqiue_scaled_heatmap_data <- t(scale(t(unique_heatmap_data)))
# Generate the heatmap
unique_heatmap <- Heatmap(
as.matrix(unqiue_scaled_heatmap_data),
top_annotation = heatmap_annotations,
cluster_rows = TRUE,
cluster_columns = TRUE, # Turn off column clustering to maintain the order of days and conditions
col = colorRamp2(c(-2, 0, 4), c("darkgreen", "white", "darkmagenta")), # Adjust colors as needed
show_column_names = FALSE, # Turn on to see the order of columns
show_row_names = FALSE,
show_heatmap_legend = TRUE,
column_title = "Figure 4: Gene Expression of Unique Significantly Differentially Expressed Genes"
)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` argument by explicitly setting
## TRUE/FALSE to it.
##
## Set `ht_opt$message = FALSE` to turn off this message.
# Draw the heatmap
draw(unique_heatmap)
This time, we better clustering, and ‘colour-blocking’ in the matrix
itself. Once again, the patient samples are sandwiched between days 6
and 10. This may be because the normalization technique used may not
have been as appropriate for such analysis, and significant noise may be
in samples. In the future, I may incorporate a library like noisyR (Moutsopoulos et al. 2021)
Here, we will define up and down-regulated gene sets. The goal of this function block was to create a general overall list of genes that were up and down-regulated for each condition, and to create a pooled one as well. In terms of significance values, we’ll use a cut-off of FDR < 0.05 and the magnitude of logFC being 0. These are the same parameters the authors used in their analysis.
ora_filtered_results_list <- lapply(results_list, function(result) {
result$table %>%
filter(FDR < 0.05)
})
up_regulated_results_list <- lapply(ora_filtered_results_list, function(result) {
result %>%
filter(logFC > 0)
})
up_regulated_results_list <- lapply(up_regulated_results_list, function(result) {
rownames(result)
})
down_regulated_results_list <- lapply(ora_filtered_results_list, function(result) {
result %>%
filter(logFC < 0)
})
down_regulated_results_list <- lapply(down_regulated_results_list, function(result) {
rownames(result)
})
all_up_regulated_genes <- unique(unlist(up_regulated_results_list))
all_down_regulated_genes <- unique(unlist(down_regulated_results_list))
Let’s set up the g:Profiler for all of the aforementioned combinations (per condition, for significant, up and down-regulated genes). (Reimand et al. 2007) We will be using the g:Profiler since it’s a popular and robust tool for ORA, and it was also a technique learned in a previous homework assignment. The paper only uses the GO BP annotation dataset, and we will be doing the same. This dataset …
go_upreg <- gost(
query = all_up_regulated_genes,
organism = "hsapiens",
sources = c("GO:BP")
)
go_downreg <- gost(
query = all_down_regulated_genes,
organism = "hsapiens",
sources = c("GO:BP")
)
go_updownreg <- gost(
query = list(Downregulated = all_down_regulated_genes, Upregulated = all_up_regulated_genes),
organism = "hsapiens",
sources = c("GO:BP")
)
Let’s look at the results of these:
numerical_go_summary_df <- data.frame(
Analysis <- c("Up and Down-Regulated", "Up-Regulated", "Down-Regulated"),
Pathways <- c(
nrow(go_updownreg$result),
nrow(go_upreg$result),
nrow(go_downreg$result)
)
)
kable(numerical_go_summary_df, col.names = c("Analysis", "Number of Pathways"), caption = "Table 7: The number of pathways that are up/down-regulated for each condition, and when they're pooled together. Notice that the the when analyzed together, the number of pathways are the sum of te individual conditions.")
| Analysis | Number of Pathways |
|---|---|
| Up and Down-Regulated | 3150 |
| Up-Regulated | 1604 |
| Down-Regulated | 1546 |
We can see that running the analysis together or separately still yields the same number of total pathways being recognized.
Let’s look at a few up-regulated terms.
We will filter pathways with greater than 1000 terms, since they could be ‘ubiquitous’ and yield false positives. Start with up-regulation.
filtered_go_upreg <- go_updownreg$result[
which(go_updownreg$result$query == "Upregulated" & go_updownreg$result$term_size <= 1000 ),]
kable(filtered_go_upreg[1:10, c(11, 3, 1)],
caption = "Table 8: Top 10 upregulated pathways. The BH values show up as zero since they are so small.",
col.names = c("Pathway", "P Value (BH)","Direction")
)
| Pathway | P Value (BH) | Direction | |
|---|---|---|---|
| 1584 | tube morphogenesis | 0 | Upregulated |
| 1585 | heart development | 0 | Upregulated |
| 1594 | vasculature development | 0 | Upregulated |
| 1597 | muscle structure development | 0 | Upregulated |
| 1599 | enzyme-linked receptor protein signaling pathway | 0 | Upregulated |
| 1605 | blood vessel development | 0 | Upregulated |
| 1613 | regulation of cell migration | 0 | Upregulated |
| 1615 | regulation of cell motility | 0 | Upregulated |
| 1619 | blood vessel morphogenesis | 0 | Upregulated |
| 1622 | circulatory system process | 0 | Upregulated |
And now down-regulation.
filtered_go_downreg <- go_updownreg$result[
which(go_updownreg$result$query == "Downregulated" & go_updownreg$result$term_size <= 1000 ),]
kable(filtered_go_downreg[1:10, c(11, 3, 1)],
caption = "Table 9: Top 10 downregulated pathways. The BH values show up as zero since they are so small.",
col.names = c("Pathway", "P Value (BH)","Direction")
)
| Pathway | P Value (BH) | Direction | |
|---|---|---|---|
| 71 | mitotic cell cycle | 0 | Downregulated |
| 73 | DNA damage response | 0 | Downregulated |
| 78 | mitotic cell cycle process | 0 | Downregulated |
| 93 | cell morphogenesis | 0 | Downregulated |
| 98 | ncRNA metabolic process | 0 | Downregulated |
| 101 | DNA repair | 0 | Downregulated |
| 102 | intracellular protein transport | 0 | Downregulated |
| 103 | protein catabolic process | 0 | Downregulated |
| 107 | cell division | 0 | Downregulated |
| 111 | ribosome biogenesis | 0 | Downregulated |
We will extract up and down-regualted genes using the same conditions for all conditions.
# Initialize lists to hold the gene sets
upregulated_genesets <- list()
downregulated_genesets <- list()
# Iterate over the filtered results list
for (condition_name in names(ora_filtered_results_list)) {
# Extract the data frame for the condition
condition_df <- ora_filtered_results_list[[condition_name]]
# Filter for upregulated genes (FDR < 0.05 and logFC > 0)
upregulated_genes <- condition_df %>%
filter(FDR < 0.05, logFC > 0) %>%
rownames()
# Filter for downregulated genes (FDR < 0.05 and logFC < 0)
downregulated_genes <- condition_df %>%
filter(FDR < 0.05, logFC < 0) %>%
rownames()
# Add to the lists
upregulated_genesets[[condition_name]] <- upregulated_genes
downregulated_genesets[[condition_name]] <- downregulated_genes
}
First, we’ll look at the temporal comparisons.
# Initialize an empty list to hold the GO analysis for each comparison
days_analysis_list <- vector("list", 9)
# Iterate over each comparison
for (i in 1:9) {
cond <- names(ora_filtered_results_list)[i]
# Perform g:Profiler analysis for upregulated and downregulated genes separately
go_analysis <- gost(
query = list(Downregulated = downregulated_genesets[[cond]],
Upregulated = upregulated_genesets[[cond]]),
organism = "hsapiens",
sources = c("GO:BP")
)
# Add the results to the list
days_analysis_list[[cond]] <- go_analysis
}
# Remove the first 9 empty entries
days_analysis_list <- days_analysis_list[-(1:9)]
And now the patient ones.
# Initialize an empty list to hold the GO analysis for each comparison
patient_analysis_list <- vector("list", 4)
# Iterate over each comparison
for (i in 10:13) {
cond <- names(ora_filtered_results_list)[i]
# Perform g:Profiler analysis for upregulated and downregulated genes separately
go_analysis <- gost(
query = list(Downregulated = downregulated_genesets[[cond]],
Upregulated = upregulated_genesets[[cond]]),
organism = "hsapiens",
sources = c("GO:BP")
)
# Add the results to the list
patient_analysis_list[[cond]] <- go_analysis
}
# Remove the first 4 empty entries
patient_analysis_list <- patient_analysis_list[-(1:4)]
We will now display both.
# Initialize an empty data frame to hold the top pathways for each condition and direction
top_pathways_df <- data.frame(matrix(ncol = length(names(days_analysis_list)) * 2, nrow = 10))
# Set the column names for the data frame
colnames(top_pathways_df) <- paste(rep(names(days_analysis_list), each = 2), rep(c("Upregulated", "Downregulated"), times = length(names(days_analysis_list))), sep = "_")
# Fill the data frame with the top pathways
for (cond in names(days_analysis_list)) {
# Extract the top 10 upregulated pathways
top_upreg <- days_analysis_list[[cond]]$result %>%
filter(query == "Upregulated", term_size <= 1000) %>%
arrange(p_value) %>%
slice(1:10) %>%
.$term_name
# Ensure that we have exactly 10 entries by padding with NA if necessary
top_upreg <- top_upreg[1:10]
if (length(top_upreg) < 10) {
top_upreg <- c(top_upreg, rep(NA, 10 - length(top_upreg)))
}
# Extract the top 10 downregulated pathways
top_downreg <- days_analysis_list[[cond]]$result %>%
filter(query == "Downregulated", term_size <= 1000) %>%
arrange(p_value) %>%
slice(1:10) %>%
.$term_name
# Ensure that we have exactly 10 entries by padding with NA if necessary
top_downreg <- top_downreg[1:10]
if (length(top_downreg) < 10) {
top_downreg <- c(top_downreg, rep(NA, 10 - length(top_downreg)))
}
# Add the pathways to the data frame
top_pathways_df[paste(cond, "Upregulated", sep = "_")] <- top_upreg
top_pathways_df[paste(cond, "Downregulated", sep = "_")] <- top_downreg
}
# Create the kable
kable(top_pathways_df, caption = "Table 10: Top 10 upregulated and downregulated pathways for each temporal comparison. Intepretation of these results are in the Interpretation section.", align = 'c')
| results_twoVS0_Upregulated | results_twoVS0_Downregulated | results_fourVS2_Upregulated | results_fourVS2_Downregulated | results_sixVS4_Upregulated | results_sixVS4_Downregulated | results_tenVS6_Upregulated | results_tenVS6_Downregulated | results_fifteenVS10_Upregulated | results_fifteenVS10_Downregulated | results_twentyVS15_Upregulated | results_twentyVS15_Downregulated | results_fortyVS20_Upregulated | results_fortyVS20_Downregulated | results_sixtyVS40_Upregulated | results_sixtyVS40_Downregulated | results_sixtyVS0_Upregulated | results_sixtyVS0_Downregulated |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| regulation of anatomical structure morphogenesis | neuron projection morphogenesis | heart development | ribosome biogenesis | heart development | DNA replication | muscle system process | chromosome organization | muscle system process | cellular response to growth factor stimulus | actin cytoskeleton organization | cytoplasmic translation | external encapsulating structure organization | mitotic cell cycle | NA | fructose metabolic process | muscle structure development | ribosome biogenesis |
| embryonic morphogenesis | plasma membrane bounded cell projection morphogenesis | muscle structure development | ncRNA metabolic process | muscle structure development | DNA-templated DNA replication | muscle contraction | mitotic cell cycle | striated muscle contraction | response to growth factor | actin filament-based process | peptide metabolic process | extracellular matrix organization | mitotic cell cycle process | NA | carbohydrate phosphorylation | heart development | ncRNA metabolic process |
| tissue morphogenesis | cell projection morphogenesis | tissue morphogenesis | rRNA metabolic process | muscle system process | DNA repair | striated muscle contraction | mitotic cell cycle process | actin filament-based movement | regulation of cell migration | supramolecular fiber organization | translation | extracellular structure organization | chromosome organization | NA | fructose 2,6-bisphosphate metabolic process | tube morphogenesis | ncRNA processing |
| heart development | cellular anatomical entity morphogenesis | heart morphogenesis | rRNA processing | muscle tissue development | DNA damage response | oxoacid metabolic process | chromosome segregation | cardiac muscle contraction | NA | cell junction organization | peptide biosynthetic process | cellular response to growth factor stimulus | chromosome segregation | NA | monosaccharide metabolic process | vasculature development | translation |
| head development | cell morphogenesis | mesenchyme development | ncRNA processing | striated muscle tissue development | chromatin organization | organic acid metabolic process | nuclear chromosome segregation | regulation of muscle system process | NA | apoptotic signaling pathway | amide biosynthetic process | circulatory system process | mitotic nuclear division | NA | NA | blood vessel development | rRNA metabolic process |
| gastrulation | cell morphogenesis involved in neuron differentiation | enzyme-linked receptor protein signaling pathway | ribosomal small subunit biogenesis | cardiac muscle tissue development | protein-DNA complex organization | carboxylic acid metabolic process | cell division | actin-mediated cell contraction | NA | enzyme-linked receptor protein signaling pathway | transmembrane receptor protein serine/threonine kinase signaling pathway | response to growth factor | organelle fission | NA | NA | muscle tissue development | amide biosynthetic process |
| morphogenesis of an epithelium | chemical synaptic transmission | tube morphogenesis | ribosomal large subunit biogenesis | muscle contraction | chromosome organization | generation of precursor metabolites and energy | nuclear division | muscle contraction | NA | cell morphogenesis | transforming growth factor beta receptor superfamily signaling pathway | regulation of system process | nuclear division | NA | NA | muscle cell differentiation | peptide biosynthetic process |
| tube morphogenesis | anterograde trans-synaptic signaling | mesenchymal cell differentiation | maturation of SSU-rRNA | actin filament-based process | double-strand break repair | monocarboxylic acid metabolic process | sister chromatid segregation | cardiac muscle cell contraction | NA | ameboidal-type cell migration | enzyme-linked receptor protein signaling pathway | enzyme-linked receptor protein signaling pathway | mitotic sister chromatid segregation | NA | NA | striated muscle tissue development | rRNA processing |
| embryo development ending in birth or egg hatching | trans-synaptic signaling | embryonic morphogenesis | peptide metabolic process | muscle cell differentiation | chromatin remodeling | heart process | regulation of cell cycle process | regulation of system process | NA | cell adhesion mediated by integrin | protein-lipid complex remodeling | camera-type eye development | sister chromatid segregation | NA | NA | cardiac muscle tissue development | DNA damage response |
| chordate embryonic development | regulation of cell projection organization | cellular anatomical entity morphogenesis | peptide biosynthetic process | striated muscle cell differentiation | negative regulation of transcription by RNA polymerase II | cardiac muscle contraction | DNA replication | regulation of muscle adaptation | NA | response to wounding | plasma lipoprotein particle remodeling | sensory system development | nuclear chromosome segregation | NA | NA | blood vessel morphogenesis | chromosome organization |
# Initialize an empty data frame to hold the top pathways for each condition and direction
top_patient_pathways_df <- data.frame(matrix(ncol = length(names(patient_analysis_list)) * 2, nrow = 10))
# Set the column names for the data frame
colnames(top_patient_pathways_df) <- paste(rep(names(patient_analysis_list), each = 2), rep(c("Upregulated", "Downregulated"), times = length(names(patient_analysis_list))), sep = "_")
# Fill the data frame with the top pathways
for (cond in names(patient_analysis_list)) {
# Extract the top 10 upregulated pathways
top_upreg <- patient_analysis_list[[cond]]$result %>%
filter(query == "Upregulated", term_size <= 1000) %>%
arrange(p_value) %>%
slice(1:10) %>%
.$term_name
# Ensure that we have exactly 10 entries by padding with NA if necessary
top_upreg <- top_upreg[1:10]
if (length(top_upreg) < 10) {
top_upreg <- c(top_upreg, rep(NA, 10 - length(top_upreg)))
}
# Extract the top 10 downregulated pathways
top_downreg <- patient_analysis_list[[cond]]$result %>%
filter(query == "Downregulated", term_size <= 1000) %>%
arrange(p_value) %>%
slice(1:10) %>%
.$term_name
# Ensure that we have exactly 10 entries by padding with NA if necessary
top_downreg <- top_downreg[1:10]
if (length(top_downreg) < 10) {
top_downreg <- c(top_downreg, rep(NA, 10 - length(top_downreg)))
}
# Add the pathways to the data frame
top_patient_pathways_df[paste(cond, "Upregulated", sep = "_")] <- top_upreg
top_patient_pathways_df[paste(cond, "Downregulated", sep = "_")] <- top_downreg
}
# Create the kable
kable(top_patient_pathways_df, caption = "Table 11: Top 10 upregulated and downregulated pathways for each patient comparison. Intepretation of these results are in the Interpretation section.", align = 'c')
| results_lvVS60_Upregulated | results_lvVS60_Downregulated | results_rvVS60_Upregulated | results_rvVS60_Downregulated | results_laVS60_Upregulated | results_laVS60_Downregulated | results_raVS60_Upregulated | results_raVS60_Downregulated |
|---|---|---|---|---|---|---|---|
| cellular respiration | mitotic cell cycle | cellular respiration | mitotic cell cycle | cellular respiration | mitotic cell cycle process | oxidative phosphorylation | mitotic cell cycle process |
| aerobic respiration | DNA damage response | aerobic respiration | mitotic cell cycle process | aerobic respiration | mitotic cell cycle | cellular respiration | mitotic cell cycle |
| energy derivation by oxidation of organic compounds | mitotic cell cycle process | energy derivation by oxidation of organic compounds | DNA damage response | oxidative phosphorylation | cell division | aerobic respiration | cell division |
| generation of precursor metabolites and energy | cell division | generation of precursor metabolites and energy | cell division | mitochondrial ATP synthesis coupled electron transport | DNA damage response | respiratory electron transport chain | DNA damage response |
| oxidative phosphorylation | DNA repair | oxidative phosphorylation | DNA repair | ATP synthesis coupled electron transport | protein-DNA complex organization | mitochondrial ATP synthesis coupled electron transport | protein-DNA complex organization |
| respiratory electron transport chain | regulation of cell cycle process | respiratory electron transport chain | regulation of cell cycle process | aerobic electron transport chain | chromosome organization | ATP synthesis coupled electron transport | chromosome organization |
| aerobic electron transport chain | cell cycle phase transition | ATP synthesis coupled electron transport | cell cycle phase transition | electron transport chain | regulation of cell cycle process | electron transport chain | regulation of cell cycle process |
| ATP synthesis coupled electron transport | chromosome organization | mitochondrial ATP synthesis coupled electron transport | chromosome organization | respiratory electron transport chain | cell cycle phase transition | aerobic electron transport chain | DNA repair |
| mitochondrial ATP synthesis coupled electron transport | protein-DNA complex organization | aerobic electron transport chain | protein-DNA complex organization | energy derivation by oxidation of organic compounds | chromatin organization | energy derivation by oxidation of organic compounds | chromatin organization |
| electron transport chain | protein modification by small protein conjugation | electron transport chain | regulation of cellular component biogenesis | generation of precursor metabolites and energy | DNA repair | generation of precursor metabolites and energy | cell cycle phase transition |
It’s quite complicated to say whether or not our data supports the conclusions made by the original paper. The reason for this is because the authors were using other datasets to compare differentiation protocols to see whether or not using a Wnt inhibitor while directing differentiation into cardiomyocytes would cause their hPSCs to be more akin to adult LV cardiomyocytes. Our data and analysis is restricted to just this dataset, so all we can comment on is what we found. With this in mind, we can begin to answer this question. Our pairwise day comparisons we indeed see many of the same terms being upregulated in Figure 2C of our article. These include terms like muscle tissue development, actin-mediated cell contraction, and more. Additionally, we also get some similar terms for down-regulation, including ncRNA metabolic process and ribosome biogenesis. Together, we see that our analysis concurs with the research group’s, such that the protocol indeed up-regulated pathways related to creating cardiomyocytes and downregulated ‘general’ and ‘stem-my’ pathways as it differentiated. The other main analysis that we did was compare the most mature hPSC, our day60 one, to patient samples for the LV, RV, LA, and RA. Previously, we saw that there were no pathways that were up/down-regulated for LV and RV suggesting that they are quite similar to each other in terms of transciptome. Across all conditions, we see that pathways related to cellular respiration, oxidative phosphorylation, the mitotic cell cycle, DNA repair and regulation, and chromatin organization are upregulated in the patient samples compared to the hPSC-derived CM. This is likely because these samples came from cells that were from cadavers, and as a result, were cells that were involved in heart function and likely had higher metabolic demands in order to function in the heart.
The article’s main goal was to show that a methodology could be derived to improve the differentiation of hPSCs into cardiomyocytes more effectively, and their choice was to using CHIR to inhibit Wnt. When the authors compared their dataset to Cyganek et al.’s, they actually found that the other group’s day 90 compared favourably to our group’s day 60, indicating that our group’s differentiation protocol was good at making mature cardiomyocytes more quickly. Now, for our analysis, it would have been nice for us to have been able to compare our analysis to the other group’s day 90 to see if we had similar ORA results. Additionally, in a study looking at NOTCH1 knockouts by Ye et al., they also conducted a overrepresentation analysis, and in their day 10 cardiomyocytes, they had a similar result where pathways upregulating heart muscle development were favoured, and pathways involved in the cell cycle were downregulated. Together, this increases the confidence that our results show a robust differentiation protocol of hPSCs in cardiomyocytes.
Include a brief introduction section with a summary of the normalization and results done in the first assignment. Assume that the person reading the report has not read your assignment #1 report. Including basic statistics from that analysis will be helpful. Assignment 1 Recap
Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why? Determining p-Values
Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction? Determining p-Values
Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest. Volcano Plots
Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not. Heat Map
Which method did you choose and why? Setting up the g:Profiler
What annotation data did you use and why? What version of the annotation are you using? Setting up the g:Profiler
How many genesets were returned with what thresholds? Analyzing All Up and Down-Regulated Genes, Pooled
Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)? This is addressed in all of the section due to the unique nature of my analysis. Over-Representation analysis
Do the over-representation results support conclusions or mechanism discussed in the original paper? Interpretation
Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results. Interpretation